In [44]:
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import numpy as np
import seaborn as sns
In [2]:
vehicle = pd.read_csv('vehicle.csv')
In [3]:
vehicle.shape
Out[3]:
(846, 19)
In [4]:
print('The vehicle data provided has 19 columns and 846 rows(observations)')
The vehicle data provided has 19 columns and 846 rows(observations)
In [5]:
vehicle.head()
Out[5]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio class
0 95 48.0 83.0 178.0 72.0 10 162.0 42.0 20.0 159 176.0 379.0 184.0 70.0 6.0 16.0 187.0 197 van
1 91 41.0 84.0 141.0 57.0 9 149.0 45.0 19.0 143 170.0 330.0 158.0 72.0 9.0 14.0 189.0 199 van
2 104 50.0 106.0 209.0 66.0 10 207.0 32.0 23.0 158 223.0 635.0 220.0 73.0 14.0 9.0 188.0 196 car
3 93 41.0 82.0 159.0 63.0 9 144.0 46.0 19.0 143 160.0 309.0 127.0 63.0 6.0 10.0 199.0 207 van
4 85 44.0 70.0 205.0 103.0 52 149.0 45.0 19.0 144 241.0 325.0 188.0 127.0 9.0 11.0 180.0 183 bus
In [6]:
vehicle.isnull().values.any()
Out[6]:
True
In [7]:
print('There seems null values in the data, which we have to treat')
There seems null values in the data, which we have to treat

As the number of NA values observations are very less, we will drop Rows with NA values in them in any of the columns

In [8]:
vehicle1= vehicle.dropna(axis =0, how ='any')
In [9]:
vehicle1.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 813 entries, 0 to 845
Data columns (total 19 columns):
compactness                    813 non-null int64
circularity                    813 non-null float64
distance_circularity           813 non-null float64
radius_ratio                   813 non-null float64
pr.axis_aspect_ratio           813 non-null float64
max.length_aspect_ratio        813 non-null int64
scatter_ratio                  813 non-null float64
elongatedness                  813 non-null float64
pr.axis_rectangularity         813 non-null float64
max.length_rectangularity      813 non-null int64
scaled_variance                813 non-null float64
scaled_variance.1              813 non-null float64
scaled_radius_of_gyration      813 non-null float64
scaled_radius_of_gyration.1    813 non-null float64
skewness_about                 813 non-null float64
skewness_about.1               813 non-null float64
skewness_about.2               813 non-null float64
hollows_ratio                  813 non-null int64
class                          813 non-null object
dtypes: float64(14), int64(4), object(1)
memory usage: 127.0+ KB
In [10]:
vehicle_null2 = vehicle1.isnull().sum()
vehicle_null2
Out[10]:
compactness                    0
circularity                    0
distance_circularity           0
radius_ratio                   0
pr.axis_aspect_ratio           0
max.length_aspect_ratio        0
scatter_ratio                  0
elongatedness                  0
pr.axis_rectangularity         0
max.length_rectangularity      0
scaled_variance                0
scaled_variance.1              0
scaled_radius_of_gyration      0
scaled_radius_of_gyration.1    0
skewness_about                 0
skewness_about.1               0
skewness_about.2               0
hollows_ratio                  0
class                          0
dtype: int64

Now we have 813 rows with all acceptble values

In [11]:
vehicle1.describe()
Out[11]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio
count 813.000000 813.000000 813.00000 813.000000 813.000000 813.000000 813.000000 813.00000 813.000000 813.000000 813.000000 813.000000 813.000000 813.000000 813.000000 813.000000 813.000000 813.000000
mean 93.656827 44.803198 82.04305 169.098401 61.774908 8.599016 168.563346 40.98893 20.558426 147.891759 188.377614 438.382534 174.252153 72.399754 6.351784 12.687577 188.979090 195.729397
std 8.233751 6.146659 15.78307 33.615402 7.973000 4.677174 33.082186 7.80338 2.573184 14.504648 31.165873 175.270368 32.332161 7.475994 4.921476 8.926951 6.153681 7.398781
min 73.000000 33.000000 40.00000 104.000000 47.000000 2.000000 112.000000 26.00000 17.000000 118.000000 130.000000 184.000000 109.000000 59.000000 0.000000 0.000000 176.000000 181.000000
25% 87.000000 40.000000 70.00000 141.000000 57.000000 7.000000 146.000000 33.00000 19.000000 137.000000 167.000000 318.000000 149.000000 67.000000 2.000000 6.000000 184.000000 191.000000
50% 93.000000 44.000000 79.00000 167.000000 61.000000 8.000000 157.000000 43.00000 20.000000 146.000000 179.000000 364.000000 173.000000 71.000000 6.000000 11.000000 189.000000 197.000000
75% 100.000000 49.000000 98.00000 195.000000 65.000000 10.000000 198.000000 46.00000 23.000000 159.000000 217.000000 586.000000 198.000000 75.000000 9.000000 19.000000 193.000000 201.000000
max 119.000000 59.000000 112.00000 333.000000 138.000000 55.000000 265.000000 61.00000 29.000000 188.000000 320.000000 1018.000000 268.000000 135.000000 22.000000 41.000000 206.000000 211.000000
In [12]:
sns.pairplot(vehicle1, diag_kind = 'kde', hue = 'class')
C:\Users\hp\Anaconda3\lib\site-packages\scipy\stats\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval
Out[12]:
<seaborn.axisgrid.PairGrid at 0x2099bed1128>

Observations from pairplot: There are minimum 3 Gaussian in the pairplot. That means we have minimum 3 clusters in the data, which is clear from case problem.

There is multi collinearity in the data. Many variables are very correlated.

Splitting data in to Independant and Dependant variable

In [13]:
X = vehicle1.drop('class', axis=1)
y = vehicle1['class']

Checking for multicollinear independant variable in the given data

In [14]:
corr_X = X.corr(method ='pearson')
mask = np.zeros_like(corr_X)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(corr_X,cmap = 'RdYlGn_r',vmax = 1.0, vmin = -1.0, mask = mask, linewidths=2.5)
plt.yticks(rotation =0)
plt.xticks(rotation = 90)
plt.show()

We can see there are multiple variables that are collienear and might not add value in defining the classes

We can drop: compactness, circulatory, distance_circularity, scatter_ratio, elongatedness, pr.axis_rectangularity, max.length_rectangularity, scaled_variance, scaled_variance.1, scaled_radius_of_gyration, scaled_radius_of_gyration

These variables are highly collienear have correlation more than 0.8 and -0.8

Transforming data in to Z-score

In [15]:
from scipy.stats import zscore
XScale = X.apply(zscore)
XScale.describe()
Out[15]:
compactness circularity distance_circularity radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio scatter_ratio elongatedness pr.axis_rectangularity max.length_rectangularity scaled_variance scaled_variance.1 scaled_radius_of_gyration scaled_radius_of_gyration.1 skewness_about skewness_about.1 skewness_about.2 hollows_ratio
count 8.130000e+02 8.130000e+02 8.130000e+02 8.130000e+02 8.130000e+02 8.130000e+02 8.130000e+02 8.130000e+02 8.130000e+02 8.130000e+02 8.130000e+02 8.130000e+02 8.130000e+02 8.130000e+02 8.130000e+02 8.130000e+02 8.130000e+02 8.130000e+02
mean -2.425284e-16 -4.642999e-16 -2.351542e-16 3.665238e-16 -2.047016e-16 -1.349201e-16 3.714399e-16 -2.062038e-16 -2.651972e-16 -7.398756e-16 -3.031605e-17 -3.550529e-17 3.348422e-16 1.312159e-16 -1.693329e-17 9.845889e-17 -1.169490e-15 3.482249e-16
std 1.000616e+00 1.000616e+00 1.000616e+00 1.000616e+00 1.000616e+00 1.000616e+00 1.000616e+00 1.000616e+00 1.000616e+00 1.000616e+00 1.000616e+00 1.000616e+00 1.000616e+00 1.000616e+00 1.000616e+00 1.000616e+00 1.000616e+00 1.000616e+00
min -2.510344e+00 -1.921444e+00 -2.665447e+00 -1.937757e+00 -1.854258e+00 -1.411767e+00 -1.710835e+00 -1.922008e+00 -1.383740e+00 -2.062109e+00 -1.874279e+00 -1.452266e+00 -2.019423e+00 -1.793474e+00 -1.291420e+00 -1.422141e+00 -2.110457e+00 -1.992013e+00
25% -8.089782e-01 -7.819133e-01 -7.635057e-01 -8.363933e-01 -5.992534e-01 -3.420870e-01 -6.824590e-01 -1.024408e+00 -6.060138e-01 -7.513773e-01 -6.863524e-01 -6.872619e-01 -7.815035e-01 -7.227236e-01 -8.847879e-01 -7.496057e-01 -8.096219e-01 -6.396066e-01
50% -7.982157e-02 -1.307527e-01 -1.929234e-01 -6.246222e-02 -9.725132e-02 -1.281510e-01 -3.497491e-01 2.578765e-01 -2.171510e-01 -1.305046e-01 -3.010789e-01 -4.246486e-01 -3.875161e-02 -1.873484e-01 -7.152328e-02 -1.891593e-01 3.400092e-03 1.718371e-01
75% 7.708611e-01 6.831980e-01 1.011639e+00 7.710020e-01 4.047507e-01 2.997208e-01 8.903515e-01 6.425619e-01 9.494376e-01 7.663115e-01 9.189540e-01 8.427456e-01 7.349483e-01 3.480268e-01 5.384252e-01 7.075550e-01 6.538177e-01 7.127995e-01
max 3.079857e+00 2.311100e+00 1.899212e+00 4.878790e+00 9.566288e+00 9.926837e+00 2.916857e+00 2.565989e+00 3.282615e+00 2.766901e+00 4.225885e+00 3.309026e+00 2.901308e+00 8.378655e+00 3.181535e+00 3.173519e+00 2.767675e+00 2.065206e+00

Dropping the highly correlated variables

In [16]:
XScale = XScale.drop(['compactness','circularity','distance_circularity','scatter_ratio','elongatedness','pr.axis_rectangularity','max.length_rectangularity','scaled_variance','scaled_variance.1','scaled_radius_of_gyration','scaled_radius_of_gyration.1'], axis =1)
#XScale = XScale.drop('circularity', axis =1)
#XScale = XScale.drop('distance_circularity', axis =1)
#XScale = XScale.drop('scatter_ratio', axis =1)
#XScale = XScale.drop('elongatedness', axis =1)
#XScale = XScale.drop('pr.axis_rectangularity', axis =1)
#XScale = XScale.drop('max.length_rectangularity', axis =1)
#XScale = XScale.drop('scaled_variance', axis =1)
#XScale= XScale.drop('scaled_variance.1', axis =1)
#XScale = XScale.drop('scaled_radius_of_gyration', axis =1)
#XScale = XScale.drop('scaled_radius_of_gyration.1', axis =1)
In [17]:
XScale.describe()
Out[17]:
radius_ratio pr.axis_aspect_ratio max.length_aspect_ratio skewness_about skewness_about.1 skewness_about.2 hollows_ratio
count 8.130000e+02 8.130000e+02 8.130000e+02 8.130000e+02 8.130000e+02 8.130000e+02 8.130000e+02
mean 3.665238e-16 -2.047016e-16 -1.349201e-16 -1.693329e-17 9.845889e-17 -1.169490e-15 3.482249e-16
std 1.000616e+00 1.000616e+00 1.000616e+00 1.000616e+00 1.000616e+00 1.000616e+00 1.000616e+00
min -1.937757e+00 -1.854258e+00 -1.411767e+00 -1.291420e+00 -1.422141e+00 -2.110457e+00 -1.992013e+00
25% -8.363933e-01 -5.992534e-01 -3.420870e-01 -8.847879e-01 -7.496057e-01 -8.096219e-01 -6.396066e-01
50% -6.246222e-02 -9.725132e-02 -1.281510e-01 -7.152328e-02 -1.891593e-01 3.400092e-03 1.718371e-01
75% 7.710020e-01 4.047507e-01 2.997208e-01 5.384252e-01 7.075550e-01 6.538177e-01 7.127995e-01
max 4.878790e+00 9.566288e+00 9.926837e+00 3.181535e+00 3.173519e+00 2.767675e+00 2.065206e+00

Split train and test data

In [18]:
# Split X and y in to train and test set in the ratio 70:30
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(XScale,y, test_size=.30, random_state=1)

Training the SVM using the train data

In [19]:
from sklearn import svm
clr = svm.SVC()  
clr.fit(X_train , y_train)
C:\Users\hp\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
Out[19]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)
In [20]:
y_pred = clr.predict(X_test)
In [23]:
clr.score(X_train, y_train)
Out[23]:
0.9209138840070299
In [24]:
clr.score(X_test , y_test)
Out[24]:
0.8688524590163934

It seems our model is overfitting- It has accuracy of 92 % for train data and 87% on test data

Performing K-fold validation and getting the cross validation score for our dataset

In [66]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

num_folds = 20
seed = 7

kfold = KFold(n_splits=num_folds, random_state=seed)
model = clr
results = cross_val_score(model, XScale, y, cv=kfold)
print(results)
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))
C:\Users\hp\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\hp\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\hp\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\hp\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\hp\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\hp\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\hp\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\hp\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\hp\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\hp\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\hp\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\hp\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\hp\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\hp\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\hp\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
[0.87804878 0.92682927 0.95121951 0.95121951 0.92682927 0.95121951
 0.80487805 0.87804878 0.90243902 0.87804878 0.92682927 0.92682927
 0.92682927 0.9        0.875      0.825      0.925      0.875
 0.85       0.875     ]
Accuracy: 89.771% (4.030%)
C:\Users\hp\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\hp\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\hp\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\hp\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\hp\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
In [ ]:
 

We can observe that the Overall average accuracy on data is coming out to be 89.8%. Here we have taken 20 folds for K-Folds.

The model has run 20 iterations on our data and calculated the accuracies in each iteration for each sample. The average is 89.78%, and we can assume that there are high possibilities that our model will perform with same accuracy in production aswell.

In [ ]:
 
In [ ]:
 

PCA and extraction of Principal components

In [39]:
# generating the covariance matrix and the eigen values for the PCA analysis
from sklearn.decomposition import PCA
cov_matrix = np.cov(XScale.T) # the relevanat covariance matrix
print('Covariance Matrix \n%s', cov_matrix)

pca = PCA(n_components=7)
pca.fit(XScale)

#generating the eigen values and the eigen vectors
#e_vals, e_vecs = np.linalg.eig(cov_matrix)
#print('Eigenvectors \n%s' %e_vecs)
#print('\nEigenvalues \n%s' %e_vals)
Covariance Matrix 
%s [[ 1.00123153  0.66819724  0.45301698  0.04474816  0.17829807  0.37605357
   0.47147529]
 [ 0.66819724  1.00123153  0.6528959  -0.05931667 -0.04081886  0.22998448
   0.25788318]
 [ 0.45301698  0.6528959   1.00123153  0.01648166  0.04126053 -0.03058065
   0.13945419]
 [ 0.04474816 -0.05931667  0.01648166  1.00123153 -0.02263933  0.11127169
   0.0982493 ]
 [ 0.17829807 -0.04081886  0.04126053 -0.02263933  1.00123153  0.07803801
   0.20153412]
 [ 0.37605357  0.22998448 -0.03058065  0.11127169  0.07803801  1.00123153
   0.89515759]
 [ 0.47147529  0.25788318  0.13945419  0.0982493   0.20153412  0.89515759
   1.00123153]]
Out[39]:
PCA(copy=True, iterated_power='auto', n_components=7, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

We can see there are 7 Principal COmponents generated for 7 Independant variables

In [40]:
# the "cumulative variance explained" analysis 
print(pca.explained_variance_)
print(pca.components_)
[2.71095449 1.56169131 1.03014435 0.94405785 0.44273669 0.24814369
 0.0708923 ]
[[ 0.50561997  0.45171642  0.34117602  0.05408612  0.12481326  0.42385075
   0.47512791]
 [ 0.16001899  0.42758162  0.52834796 -0.18816748 -0.16238042 -0.50760435
  -0.4392946 ]
 [-0.05294247  0.07482203  0.04555886  0.66866967 -0.73086364  0.09000123
  -0.01192209]
 [ 0.07528521 -0.10870412  0.16979219  0.70847467  0.61429557 -0.24292863
  -0.12400175]
 [-0.66405635 -0.16031983  0.66423396 -0.05389805  0.03437706  0.12185502
   0.27052639]
 [-0.5073182   0.72760426 -0.29714232  0.09434842  0.20667055  0.20028325
  -0.18220593]
 [ 0.10921998 -0.20149665  0.20986197 -0.02838272  0.05137159  0.66378596
  -0.6777696 ]]
In [41]:
print(pca.explained_variance_ratio_)
[0.38680285 0.22282434 0.14698247 0.13469952 0.0631703  0.0354055
 0.01011502]
In [47]:
# Plotting the variance expalained by the principal components and the cumulative variance explained.
plt.bar(list(range(1,8)),pca.explained_variance_ratio_,alpha=0.5, align='center')
plt.ylabel('Variation explained')
plt.xlabel('eigen Value')
plt.show()
In [49]:
plt.step(list(range(1,8)),np.cumsum(pca.explained_variance_ratio_), where='mid')
plt.ylabel('Cum of variation explained')
plt.xlabel('eigen Value')
plt.show()

Selecting Eigen Vectors that expain 95% variance in data : as seen above upto 5 EV we select for 95% variance

In [50]:
pca5 = PCA(n_components=5)
pca5.fit(XScale)
print(pca5.components_)
[[ 0.50561997  0.45171642  0.34117602  0.05408612  0.12481326  0.42385075
   0.47512791]
 [ 0.16001899  0.42758162  0.52834796 -0.18816748 -0.16238042 -0.50760435
  -0.4392946 ]
 [-0.05294247  0.07482203  0.04555886  0.66866967 -0.73086364  0.09000123
  -0.01192209]
 [ 0.07528521 -0.10870412  0.16979219  0.70847467  0.61429557 -0.24292863
  -0.12400175]
 [-0.66405635 -0.16031983  0.66423396 -0.05389805  0.03437706  0.12185502
   0.27052639]]

Data with reduced dimension

In [51]:
Xpca5 = pca5.transform(XScale)
In [52]:
Xpca5
Out[52]:
array([[ 0.80361767,  0.79048606, -0.25455511,  0.16561963, -0.15871031],
       [-0.40524068, -0.66597988,  0.2508966 ,  0.43289342,  0.80457591],
       [ 0.92474619,  0.41439515,  1.31756194,  0.9645766 , -0.78216814],
       ...,
       [ 1.38452333,  1.05336819,  0.21740729, -0.92040049, -0.5106564 ],
       [-0.55245981, -0.51535886, -1.87070617, -0.15359803,  0.4158962 ],
       [-1.89857257, -0.29560584, -1.22319542, -0.3329546 ,  0.34656481]])

Developing model with Principal Components

Split train and test data for Principal Components

In [53]:
# Split X and y in to train and test set in the ratio 70:30
from sklearn.model_selection import train_test_split

Xpca_train, Xpca_test, ypca_train, ypca_test = train_test_split(Xpca5,y, test_size=.30, random_state=1)

Training SVM using the PCA train data

In [54]:
from sklearn import svm
clpca = svm.SVC()  
clpca.fit(Xpca_train , ypca_train)
C:\Users\hp\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
Out[54]:
SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)
In [57]:
clpca.score(Xpca_train, ypca_train)
Out[57]:
0.8418277680140598
In [58]:
clpca.score(Xpca_test, ypca_test)
Out[58]:
0.7663934426229508

We can see that the model accuracy decreased when we used selected Principal components, this is because we have dropped some principal components and hence the variance define by our selected components is limited now(95%)

In [ ]:
 

Performing K-fold validation and getting the cross validation score for our reduced dimension dataset

In [64]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

num_folds = 20
seed = 7

kfold = KFold(n_splits=num_folds, random_state=seed)
model = clpca
results_cv = cross_val_score(model, Xpca5, y, cv=kfold)
print(results)
print("Accuracy: %.3f%% (%.3f%%)" % (results_cv.mean()*100.0, results_cv.std()*100.0))
C:\Users\hp\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\hp\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\hp\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\hp\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\hp\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\hp\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\hp\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\hp\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\hp\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\hp\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\hp\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\hp\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\hp\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\hp\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\hp\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\hp\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\hp\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\hp\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
[0.7804878  0.82926829 0.80487805 0.80487805 0.73170732 0.90243902
 0.70731707 0.7804878  0.85365854 0.73170732 0.7804878  0.85365854
 0.75609756 0.85       0.725      0.8        0.825      0.775
 0.875      0.775     ]
Accuracy: 79.710% (5.177%)
C:\Users\hp\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)
C:\Users\hp\Anaconda3\lib\site-packages\sklearn\svm\base.py:196: FutureWarning: The default value of gamma will change from 'auto' to 'scale' in version 0.22 to account better for unscaled features. Set gamma explicitly to 'auto' or 'scale' to avoid this warning.
  "avoid this warning.", FutureWarning)

We can observe that the Overall average accuracy on data is coming out to be 79.7%. Here we have taken 20 folds for K-Folds.

The model has run 20 iterations on our data and calculated the accuracies in each iteration for each sample. The average is 79.7%, and we can assume that there are high possibilities that our model will perform with same accuracy in production aswell.

Analysis on resuls with and wwithout PCA and with and withou K Fold Validation:

In [60]:
print('Accuracy on test data with SVM with origial independant data set:')
print(clr.score(X_test , y_test))
print('Accuracy on test data with SVM with reduced dimensions (5 Principal Components) independant data set:')
print(clpca.score(Xpca_test, ypca_test))
Accuracy on test data with SVM with origial independant data set:
0.8688524590163934
Accuracy on test data with SVM with reduced dimensions (5 Principal Components) independant data set:
0.7663934426229508

We can see that the model accuracy decreased when we used selected Principal components, this is because we have dropped some principal components and hence the variance define by our selected components is limited now(95%)

In [69]:
print('Accuracy on original data with SVM using K fold validation set:')
print("Accuracy: %.3f%% (%.3f%%)" % (results.mean()*100.0, results.std()*100.0))
print('Accuracy on data with SVM with reduced dimensions (5 Principal Components) independant data set and using K fold Validaion:')
print("Accuracy: %.3f%% (%.3f%%)" % (results_cv.mean()*100.0, results_cv.std()*100.0))
Accuracy on original data with SVM using K fold validation set:
Accuracy: 89.771% (4.030%)
Accuracy on data with SVM with reduced dimensions (5 Principal Components) independant data set and using K fold Validaion:
Accuracy: 79.710% (5.177%)

We can observe that the Overall average accuracy on data is increased by using K fold validaion. Here we have taken 20 folds for K-Folds.

The model has run 20 iterations on our data and calculated the accuracies in each iteration for each sample. The average of the accurcies for each iteration is taken, and we can assume that there are high possibilities that our model will perform with same accuracy in production aswell.

Though it is observalble that with Principal COmponents, the accuracy loers because of reduced dimensions. But that is not a big issue if we want to work on just 5 Independant variables compared to 19 that we have had in our provided data set. So in this analysis we have initially reduced the independant variable by removing highly correlated variables and then by using Principal Component Ananlysis. This kidn of reducton may help in avoiding the curse of dimensionality, overfitting and reducing the cost of training and running model.

In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]:
 
In [ ]: